4 research outputs found

    Selection of representative set of membrane proteins of known structure: development of improved algorithms using the random model concept

    Get PDF
    Strukturu membranskih proteina osjetno je teže eksperimentalno odrediti nego strukturu topljivih proteina. Kako bi se razvio pouzdani model za predviđanje strukture proteina, potrebno je provesti njegovu optimizaciju na Å”to većem (reprezentativnom) skupu membranskih proteina poznatih struktura, međusobnih sličnosti ispod 30%. Postojeći algoritmi za izbor reprezentativnih skupova integralnih membranskih proteina alfa vrste ne koriste informaciju o složenosti strukture, iako se očekuje da će modeli biti pouzdaniji ako su razvijeni na skupu proteina složenijih struktura. Stoga je uveden koncept nasumičnog modela s dvije sekundarne strukture i uočeno da je izraz za procjenu njegove točnosti u vezi sa složenoŔću strukture. Potom su razvijeni koncepti binomnog i segmentnog nasumičnog modela i izvedeni izrazi za broj mogućih realizacija modelne strukture proteina koji pokazuje analogiju s entropijom. Segmentni nasumični model odgovara strukturi membranskih proteina u kojima viÅ”e susjednih aminokiselina čini segmente pravilne sekundarne strukture alfa. Broj realizacija modelne strukture segmentnog nasumičnog modela povezan je sa složenoŔću strukture, i pokazuje značajnu korelaciju s brojem transmembranskih segmenata. To je ugrađeno u razvijene algoritme, a najbolji je temeljen na originalnoj analizi broja zajedničkih susjeda između proteina u početnom skupu. Primjene tih algoritama na baze membranskih proteina poznate strukture daju veće reprezentativne skupove značajno složenijih struktura od onih iz literature.It is more difficult to determine experimentally the structure of membrane protein than that of soluble protein. In order to develop a reliable model for predicting protein structure, it is necessary to perform model optimisation on the largest (representative) set of membrane proteins of known structures with mutual similarities below 30%. Existing algorithms for selection of representative sets of membrane proteins of alpha-type do not use information about the complexity of structure, although it is expected that the models will be more reliable in prediction if they are developed on a more complex protein structures. Consequently, the concept of a random model based on two secondary structures was introduced and noticed that the formula for estimation of its accuracy is connected with the complexity of structure. Then, the concepts of the binomial and segmental random model were introduced, as well as formulae for the number of possible realizations of protein model structure, showing the analogy with entropy, were developed. The segmental model is best suited to the membrane protein structure in which several adjacent amino acids form segments having regular secondary structure of alpha type. The number of realizations of model structure of segmental random model is related to the complexity of structure showing a significant correlation with the number of transmembrane segments. It is involved in developed algorithms, and the best one is based on original analysis of the number of common neighbours between proteins in the initial set. Applications of these algorithms to databases of membrane proteins of known structures produce larger representative sets of structures which are significantly more complex than those published in the literature

    Selection of representative set of membrane proteins of known structure: development of improved algorithms using the random model concept

    Get PDF
    Strukturu membranskih proteina osjetno je teže eksperimentalno odrediti nego strukturu topljivih proteina. Kako bi se razvio pouzdani model za predviđanje strukture proteina, potrebno je provesti njegovu optimizaciju na Å”to većem (reprezentativnom) skupu membranskih proteina poznatih struktura, međusobnih sličnosti ispod 30%. Postojeći algoritmi za izbor reprezentativnih skupova integralnih membranskih proteina alfa vrste ne koriste informaciju o složenosti strukture, iako se očekuje da će modeli biti pouzdaniji ako su razvijeni na skupu proteina složenijih struktura. Stoga je uveden koncept nasumičnog modela s dvije sekundarne strukture i uočeno da je izraz za procjenu njegove točnosti u vezi sa složenoŔću strukture. Potom su razvijeni koncepti binomnog i segmentnog nasumičnog modela i izvedeni izrazi za broj mogućih realizacija modelne strukture proteina koji pokazuje analogiju s entropijom. Segmentni nasumični model odgovara strukturi membranskih proteina u kojima viÅ”e susjednih aminokiselina čini segmente pravilne sekundarne strukture alfa. Broj realizacija modelne strukture segmentnog nasumičnog modela povezan je sa složenoŔću strukture, i pokazuje značajnu korelaciju s brojem transmembranskih segmenata. To je ugrađeno u razvijene algoritme, a najbolji je temeljen na originalnoj analizi broja zajedničkih susjeda između proteina u početnom skupu. Primjene tih algoritama na baze membranskih proteina poznate strukture daju veće reprezentativne skupove značajno složenijih struktura od onih iz literature.It is more difficult to determine experimentally the structure of membrane protein than that of soluble protein. In order to develop a reliable model for predicting protein structure, it is necessary to perform model optimisation on the largest (representative) set of membrane proteins of known structures with mutual similarities below 30%. Existing algorithms for selection of representative sets of membrane proteins of alpha-type do not use information about the complexity of structure, although it is expected that the models will be more reliable in prediction if they are developed on a more complex protein structures. Consequently, the concept of a random model based on two secondary structures was introduced and noticed that the formula for estimation of its accuracy is connected with the complexity of structure. Then, the concepts of the binomial and segmental random model were introduced, as well as formulae for the number of possible realizations of protein model structure, showing the analogy with entropy, were developed. The segmental model is best suited to the membrane protein structure in which several adjacent amino acids form segments having regular secondary structure of alpha type. The number of realizations of model structure of segmental random model is related to the complexity of structure showing a significant correlation with the number of transmembrane segments. It is involved in developed algorithms, and the best one is based on original analysis of the number of common neighbours between proteins in the initial set. Applications of these algorithms to databases of membrane proteins of known structures produce larger representative sets of structures which are significantly more complex than those published in the literature

    The Difference Between the Accuracy of Real and the Corresponding Random Model is a Useful Parameter for Validation of Two-State Classification Model Quality

    Get PDF
    The simplest and the most commonly used measure for assess the classification model quality is parameter Q2 = 100 (p + n) / N (%) named the classification accuracy, p, n and N are the total numbers of correctly predicted compounds in the first and in the second class, and the total number of elements of classes (compounds) in data set, respectively. Moreover, the most probable accuracy that can be obtained by a random model is calculated for two-state model by the formulae Q2,rnd = 100 [(p + u) (p + o) + (n + u) (n + o)] / N2 (%), where u and o are the total number of under-predictions (when class 1 is predicted by the model as class 2) and over-predictions (when class 2 is predicted by the model as class 1) in data set, respectively. Finally, the difference between these two parameter Ī”Q2 = Q2 ā€“ Q2,rnd is introduced, and it is suggested to compute and give Ī”Q2 for each two-state classification model to assess its contribution over the accuracy of the corresponding random model. When data set is ideally balanced having the same numbers of elements in both classes, the two-state classification problem is the most difficult with maximal Q2 = 100 % and Q2,rnd = 50 %, giving the maximal Ī”Q2 = 50 %. The usefulness of Ī”Q2 parameter is illustrated in comparative analysis on two-class classification models from literature for prediction of secondary structure of membrane proteins and on several quantiĀ¬tative structure-property models. Real contributions of these models over the random level of accuracy is calculated, and their Ī”Q2 values are compared mutually and with the value of Ī”Q2 (= 50 %) for the most difficult two-state classification model

    Estimation of Random Accuracy and its Use in Validation of Predictive Quality of Classification Models within Predictive Challenges

    Get PDF
    Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation
    corecore